The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
| S. No | Table Name | Rows | Features | Numerical Features | Categorical Features | Megabytes |
|---|---|---|---|---|---|---|
| 1 | application_train | 307,511 | 122 | 106 | 16 | 158MB |
| 2 | application_test | 48,744 | 121 | 105 | 16 | 25MB |
| 3 | bureau | 1,716,428 | 17 | 14 | 3 | 162MB |
| 4 | bureau_balance | 27,299,925 | 3 | 2 | 1 | 358MB |
| 5 | credit_card_balance | 3,840,312 | 23 | 22 | 1 | 405MB |
| 6 | installments_payments | 13,605,401 | 8 | 21 | 16 | 690MB |
| 7 | previous_application | 1,670,214 | 37 | 8 | 0 | 386MB |
| 8 | POS_CASH_balance | 10,001,358 | 8 | 7 | 1 | 375MB |
As part of the data download comes a Data Dictionary. It is named as HomeCredit_columns_description.csv. It contains information about all fields present in all the above tables. (like the metadata).
kaggle librarykaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
!pwd
!ls -l ~/.kaggle/kaggle.json
json_file_not_exists = True #Change this to false if you already have json from kaggle
if json_file_not_exists:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions files home-credit-default-risk
Create a base directory:
DATA_DIR = "../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIRDATA_DIR = "../Data/home-credit-default-risk"
!mkdir $DATA_DIR
!ls -l $DATA_DIR
data_not_downloaded = True # change it to false if you already have data
if data_not_downloaded:
! kaggle competitions download home-credit-default-risk -p $DATA_DIR --force
!pwd
!ls -l $DATA_DIR
#!rm -r DATA_DIR
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
unzippingReq = False #True if not unzipped
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
# extractall(): Extract all members from the archive to the current working directory. path specifies a different directory to extract to
zip_ref.extractall(DATA_DIR)
zip_ref.close()
# lets store the datasets in a dictionary so we can keep track of them easily
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
DATA_DIR = "../Data/home-credit-default-risk"
datasets = {}
# %%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets[ds_name].shape
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
def plot_missing_data(df, x, y):
g = sns.displot(
data=datasets[df].isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25
)
g.fig.set_figwidth(x)
g.fig.set_figheight(y)
datasets["application_train"].info()
datasets["application_train"].columns
datasets["application_train"].dtypes
datasets["application_train"].describe() #numerical only features
datasets["application_train"].describe(include='all')
datasets["application_train"].corr()
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("application_train",18,20)
datasets["application_test"].info()
datasets["application_test"].columns
datasets["application_test"].dtypes
datasets["application_test"].describe() #numerical only features
datasets["application_test"].describe(include='all') #look at all categorical and numerical
datasets["application_test"].corr()
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("application_test",18,20)
datasets["bureau"].info()
datasets["bureau"].columns
datasets["bureau"].dtypes
datasets["bureau"].describe()
datasets["bureau"].describe(include='all')
datasets["bureau"].corr()
percent = (datasets["bureau"].isnull().sum()/datasets["bureau"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["bureau"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("bureau",18,20)
datasets["bureau_balance"].info()
datasets["bureau_balance"].columns
datasets["bureau_balance"].dtypes
datasets["bureau_balance"].describe()
datasets["bureau_balance"].describe(include='all')
datasets["bureau_balance"].corr()
percent = (datasets["bureau_balance"].isnull().sum()/datasets["bureau_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["bureau_balance"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("bureau_balance",18,20)
datasets["POS_CASH_balance"].info()
datasets["POS_CASH_balance"].columns
datasets["POS_CASH_balance"].dtypes
datasets["POS_CASH_balance"].describe()
datasets["POS_CASH_balance"].describe(include='all')
datasets["POS_CASH_balance"].corr()
percent = (datasets["POS_CASH_balance"].isnull().sum()/datasets["POS_CASH_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["POS_CASH_balance"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("POS_CASH_balance",18,20)
datasets["credit_card_balance"].info()
datasets["credit_card_balance"].columns
datasets["credit_card_balance"].dtypes
datasets["credit_card_balance"].describe()
datasets["credit_card_balance"].describe(include='all')
datasets["credit_card_balance"].corr()
percent = (datasets["credit_card_balance"].isnull().sum()/datasets["credit_card_balance"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["credit_card_balance"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("credit_card_balance",18,20)
datasets["previous_application"].info()
datasets["previous_application"].columns
datasets["previous_application"].dtypes
datasets["previous_application"].describe()
datasets["previous_application"].describe(include='all')
datasets["previous_application"].corr()
percent = (datasets["previous_application"].isnull().sum()/datasets["previous_application"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["previous_application"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("previous_application",18,20)
datasets["installments_payments"].info()
datasets["installments_payments"].columns
datasets["installments_payments"].dtypes
datasets["installments_payments"].describe()
datasets["installments_payments"].describe(include='all')
datasets["installments_payments"].corr()
percent = (datasets["installments_payments"].isnull().sum()/datasets["installments_payments"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["installments_payments"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
plot_missing_data("installments_payments",18,20)
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data=missing_application_train_data.reset_index().rename(columns={'index':'Attributes'})
missing_application_train_data
plt.figure(figsize = (30, 5))
sns.barplot(x='Attributes',y='Percent',data=missing_application_train_data[missing_application_train_data.Percent>0], palette = ['green'])
plt.xlabel('Attributes');
plt.ylabel('Percentage of missing values %');
plt.title('Percentage values of missing entries in Attributes');
plt.xticks(rotation=90);
plt.show()
null_data_percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing_data = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_app_train_data = pd.concat([null_data_percent, sum_missing_data], axis=1, keys=['Percent', "Test Missing Count"])
missing_app_train_data
def col(cat):
plt.figure(figsize=(10,10))
plt.title("Loan Default with respect to "+cat,fontweight='bold' , fontsize =16)
sns.countplot(x=df[cat],hue='TARGET',data=df, palette = 'Blues')
df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]
plt.xticks(rotation=90)
print(datasets["application_train"]['CODE_GENDER'].value_counts())
sns.countplot(datasets["application_train"]['CODE_GENDER'], palette = 'Oranges')
plt.title("Percentage of loan with reference to gender", fontweight = 'bold', fontsize = 16)
The number of female borrowing the loan and who haven't paid is comparatively higher than men.
sns.catplot(data = datasets["application_train"], x='TARGET', kind = 'count')
plt.xlabel('Target');
plt.ylabel('Numbers of Borrowers');
plt.title('Target values against the number of borrowers');
plt.show()
Many people would rather take out a cash loan than a revolving loan.
datasets["application_train"]['TARGET'].astype(int).plot.hist();
plt.figure(figsize = (5, 5))
sns.distplot(datasets["application_train"].AMT_CREDIT, color = 'maroon')
plt.xlabel('Amount Credit');
plt.ylabel('Density distribution');
plt.title('Amount Credit against the density');
plt.show()
plt.figure(figsize = (5, 5))
sns.boxplot(data = datasets["application_train"], x = 'AMT_INCOME_TOTAL', color = 'green')
plt.xlim(0,1000000)
plt.xlabel('Income Total Amount');
plt.title('Distribution of Income Total Amount');
plt.show()
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==1],x='NAME_INCOME_TYPE',kind='count',hue="TARGET", palette = ['red'])
plt.xlabel('Income types')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Income Types')
plt.xticks(rotation=75)
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==0],x='NAME_INCOME_TYPE',kind='count',hue="TARGET", palette = ['purple'])
plt.xlabel('Income types')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Income Types')
plt.xticks(rotation=75)
plt.figure(figsize = (5, 5))
sns.countplot(datasets["application_train"].CODE_GENDER, palette=sns.color_palette('bright')[:2])
plt.xlabel('Gender');
plt.ylabel('Number of Borrowers');
plt.title('Frequency of borrowers against Gender');
plt.show()
sns.catplot(data=datasets["application_train"][application_train.TARGET==1],x='CODE_GENDER',kind='count',hue="TARGET",palette = ['orange']);
plt.xlabel('Gender Type')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Gender')
sns.catplot(data=datasets["application_train"][application_train.TARGET==0],x='CODE_GENDER',kind='count',hue="TARGET",palette = ['cyan']);
plt.xlabel('Gender Type')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Gender')
plt.show()
It is observed that the more number of females are the defaulters but it is also observed that there is high difference between number of males and females during sampling. Hence, Based on the given data females are more defaulters.
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.xticks(rotation=90);
print(datasets["application_train"]['NAME_FAMILY_STATUS'].value_counts())
sns.countplot(datasets["application_train"]['NAME_FAMILY_STATUS'], palette = 'Purples')
plt.title("Family Status vs Count", fontweight = 'bold', fontsize = 11)
The bulk of clients who are married have paid the smallest loan amount, while the number of clients with an uncertain status is insignificant.
fig,ax = plt.subplots(figsize=(10,10))
sns.countplot(x='CNT_CHILDREN', hue = 'TARGET',data=datasets["application_train"], palette=['#432371',"#FAAE7B"])
plt.xlabel("Number of Children")
plt.ylabel('Numbers of borrowers')
plt.title('Number of borrowers against target based on childrens count');
plt.xticks(rotation=70)
plt.show()
It is observed that people with no children have a difficult time of paying their borrowed loan. It is also observed that as the number of children increase, the number of defaulters reduce.
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==1],x='NAME_FAMILY_STATUS',kind='count',hue="TARGET",palette = ['green'])
plt.xlabel('Family Status')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Family Status')
plt.xticks(rotation=75)
sns.catplot(data=datasets["application_train"][datasets["application_train"].TARGET==0],x='NAME_FAMILY_STATUS',kind='count',hue="TARGET", palette = ['yellow'])
plt.xlabel('Family Status')
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target value based on Family Status')
plt.xticks(rotation=75)
plt.show()
It is observed that the more number of married are the defaulters but it is also observed that there is high difference between number of married samples and the other types . Hence, Based on the given data married people are more defaulters.
print(datasets["application_train"]['FLAG_OWN_CAR'].value_counts())
sns.countplot(datasets["application_train"]['FLAG_OWN_CAR'], palette = 'Oranges')
plt.title("Percentage of car owners in the dataset", fontweight = 'bold', fontsize = 11)
About half of the population own a car, but the majority of clients (more than half) do not, and the majority of them are likely to have defaulted on their loan.
years=datasets["application_train"][['TARGET','DAYS_BIRTH']]
years['YEARS_BIRTH']=years['DAYS_BIRTH']/-365
years['group']=pd.cut(years['YEARS_BIRTH'],bins=np.linspace(0,50,num=11))
age_groups = years.groupby('group').mean()
age_groups
plt.figure(figsize=(10,10))
plt.bar(age_groups.index.astype(str), 100*age_groups['TARGET'], color = 'Maroon')
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to repay (%)')
plt.title('Failure to repay the loan based on Age group')
plt.show()
It is observed that the age group of 20 to 25 are more prone to failure to repay their loan.
print(datasets["application_train"]['NAME_EDUCATION_TYPE'].value_counts())
sns.countplot(datasets["application_train"]['NAME_EDUCATION_TYPE'])
plt.title("Education type vs count")
plt.xticks(rotation=90)
Clients with Academic Degree are more likely to repay the loan compared to others.
It is observed that people with Secondary/ Secondary degree are usually the defaulters followed by Higher education degree holders.
plt.figure(figsize=[20,15])
plt.pie(datasets["application_train"]['NAME_HOUSING_TYPE'].value_counts(),labels = datasets["application_train"]['NAME_HOUSING_TYPE'].value_counts().index,autopct='%1.1f%%')
my_circle=plt.Circle( (0,0), 0.5, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.show()
We can see from the graph above that the bulk of the clients who live in apartments/houses have not paid their loans, while the number of clients who live in office apartments and co-op apartments is minimal.
Housing_type1 = datasets["application_train"][datasets["application_train"].NAME_HOUSING_TYPE=='Co-op apartment']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'Borrowers_count','index':'TARGET'})
Housing_type1['NAME_HOUSING_TYPE'] = 'Co-op apartment'
Housing_type2 = datasets["application_train"][datasets["application_train"].NAME_HOUSING_TYPE=='Office apartment']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'Borrowers_count','index':'TARGET'})
Housing_type2['NAME_HOUSING_TYPE'] = 'Office apartment'
Housing_type3 = datasets["application_train"][datasets["application_train"].NAME_HOUSING_TYPE=='Rented apartment']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'Borrowers_count','index':'TARGET'})
Housing_type3['NAME_HOUSING_TYPE'] = 'Rented apartment'
Housing_type4 = datasets["application_train"][datasets["application_train"].NAME_HOUSING_TYPE=='Municipal apartment']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'Borrowers_count','index':'TARGET'})
Housing_type4['NAME_HOUSING_TYPE'] = 'Municipal apartment'
Housing_type5 = datasets["application_train"][datasets["application_train"].NAME_HOUSING_TYPE=='With parents']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'Borrowers_count','index':'TARGET'})
Housing_type5['NAME_HOUSING_TYPE'] = 'With parents'
Housing_type6 = datasets["application_train"][datasets["application_train"].NAME_HOUSING_TYPE=='House / apartment']['TARGET'].value_counts().reset_index().rename(columns={'TARGET':'Borrowers_count','index':'TARGET'})
Housing_type6['NAME_HOUSING_TYPE'] = 'House / apartment'
HousingTypes = Housing_type1.append(Housing_type2, ignore_index=True,sort=False)
HousingTypes = HousingTypes.append(Housing_type3, ignore_index=True,sort=False)
HousingTypes = HousingTypes.append(Housing_type4, ignore_index=True,sort=False)
HousingTypes = HousingTypes.append(Housing_type5, ignore_index=True,sort=False)
HousingTypes = HousingTypes.append(Housing_type6, ignore_index=True,sort=False)
HousingTypes
plt.figure(figsize = (15, 5))
sns.barplot(x='NAME_HOUSING_TYPE',y='Borrowers_count',hue = 'TARGET',data=HousingTypes[HousingTypes['TARGET']==1], palette = ['skyblue'])
plt.xlabel("Housing Types")
plt.ylabel('Numbers of borrowers')
plt.title('Number of borrowers against target based on housing types');
plt.show()
plt.figure(figsize = (15, 5))
sns.barplot(x='NAME_HOUSING_TYPE',y='Borrowers_count',hue = 'TARGET',data=HousingTypes[HousingTypes['TARGET']==0], palette = ['pink'] )
plt.xlabel("Housing Types")
plt.ylabel('Numbers of borrowers')
plt.title('Number of borrowers against target based on housing types');
plt.show()
It is observed that the people with House/Apartment fails to repay their borrowed loan.
print(datasets["application_train"]['NAME_CONTRACT_TYPE'].value_counts())
sns.countplot(datasets["application_train"]['NAME_CONTRACT_TYPE'], palette = 'Reds')
plt.title("Types of loan available", fontweight = 'bold', fontsize = 12)
fig,ax = plt.subplots(figsize=(8,8))
sns.countplot(x='TARGET', hue = 'OCCUPATION_TYPE',data=datasets["application_train"])
plt.xlabel("Loan Type")
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers against target based on loan types')
plt.xticks(rotation=70)
plt.show()
The above graph depicts the multiple kinds of occupation types and what type of loan (Cash loan/revolving loan) they have taken.
Income_credit = datasets["application_train"][['AMT_INCOME_TOTAL','AMT_CREDIT','TARGET']]
Income_credit['Ratio'] = (Income_credit['AMT_INCOME_TOTAL']/Income_credit['AMT_CREDIT'])
Income_credit
import numpy as np
def count_bins(df):
count_dict={}
for i in range(len(df)):
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.1 and df["Ratio"].iloc[i]>=0):
if(0 in count_dict):
count_dict[0]+=1
else:
count_dict[0]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.2 and df["Ratio"].iloc[i]>=0.1):
if(0 in count_dict):
count_dict[1]+=1
else:
count_dict[1]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.3 and df["Ratio"].iloc[i]>=0.2):
if(0 in count_dict):
count_dict[2]+=1
else:
count_dict[2]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.4 and df["Ratio"].iloc[i]>=0.3):
if(0 in count_dict):
count_dict[3]+=1
else:
count_dict[3]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.5 and df["Ratio"].iloc[i]>=0.4):
if(0 in count_dict):
count_dict[4]+=1
else:
count_dict[4]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.6 and df["Ratio"].iloc[i]>=0.5):
if(0 in count_dict):
count_dict[5]+=1
else:
count_dict[5]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.7 and df["Ratio"].iloc[i]>=0.6):
if(0 in count_dict):
count_dict[6]+=1
else:
count_dict[6]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.8 and df["Ratio"].iloc[i]>=0.7):
if(0 in count_dict):
count_dict[7]+=1
else:
count_dict[7]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<0.9 and df["Ratio"].iloc[i]>=0.8):
if(0 in count_dict):
count_dict[8]+=1
else:
count_dict[8]=1
if df["TARGET"].iloc[i] == 0 and (df["Ratio"].iloc[i]<=1.0 and df["Ratio"].iloc[i]>=0.9):
if(0 in count_dict):
count_dict[9]+=1
else:
count_dict[9]=1
return count_dict
ff = count_bins(Income_credit)
ratios = list(ff.keys())
count = list(ff.values())
AMT_INCOME_TOTAL_AMT_CREDIT = []
for i in ratios:
AMT_INCOME_TOTAL_AMT_CREDIT.append(i / 10)
fig = plt.figure(figsize = (20, 5))
plt.bar(AMT_INCOME_TOTAL_AMT_CREDIT, count, color ='grey',width=0.08)
plt.xlim(0,1,0.1)
plt.xlabel("Income/Credit")
plt.ylabel('Number of borrowers')
plt.title('Number of borrowers with the Income/credit Ratio for target value 0');
plt.show()
It is observed that the income of the borrower is 10% of their income in most of the samples.
corr_app_train = datasets["application_train"].corr()['TARGET'].sort_values()
corr_app_train = corr_app_train.reset_index().rename(columns={'index':'Attributes','TARGET':'Correlation'})
corr_app_train
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
plt.figure(figsize = (10, 5))
sns.barplot(x='Attributes',y='Correlation',data= corr_app_train[corr_app_train.Correlation>0], palette = ['grey'])
plt.xlabel('Attributes')
plt.ylabel('Positive Correlation')
plt.title('Positive Correlated attributes with target')
plt.xticks(rotation=90)
plt.show()
The above graph depicts the column features which are Positively correlated based on target.
plt.figure(figsize = (30, 5))
sns.barplot(x='Attributes',y='Correlation',data= corr_app_train[corr_app_train.Correlation<=0], palette = ['purple'])
plt.xlabel('Attributes')
plt.ylabel('Negative Correlation')
plt.title('Negative Correlated attributes with target')
plt.xticks(rotation=90)
plt.show()
The above graph depicts the column features which are Negatively correlated based on target.
from pandas.plotting import scatter_matrix
#We can take the top 10 features
top_corr_features = ["TARGET", "REGION_RATING_CLIENT","REGION_RATING_CLIENT_W_CITY","DAYS_LAST_PHONE_CHANGE",
"DAYS_BIRTH", "EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_ID_PUBLISH","REG_CITY_NOT_WORK_CITY"]
# scatter_matrix(datasets["application_train"][top_corr_features], figsize=(12, 8));
df = datasets["application_train"].copy()
df2 = df[top_corr_features]
corr = df2.corr()
corr.style.background_gradient(cmap='PuBu').set_precision(2)
corelations = datasets["application_train"].corr()['TARGET'].sort_values()
high_corelation = corelations.tail(15)
low_corelations = corelations.head(15)
print('most positive corelations:\n', high_corelation)
print('most negative corelations:\n', low_corelations)
most_corr=datasets["application_train"][['REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY','DAYS_EMPLOYED','DAYS_BIRTH','TARGET']]
most_corr_corr = most_corr.corr()
sns.set_style("dark")
sns.set_context("notebook", font_scale=2.0, rc={"lines.linewidth": 1.0})
fig, axes = plt.subplots(figsize = (20,10),sharey=True)
sns.heatmap(most_corr_corr,cmap=plt.cm.RdYlBu_r,vmin=-0.25,vmax=0.6,annot=True)
plt.title('Correlation Heatmap for features with highest correlations with target variables')
from scipy import stats
#import latexify
import time
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
import json
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from scipy import stats
from sklearn.svm import SVC
import warnings
from pprint import pprint
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
DATA_DIR="Data/home-credit-default-risk"
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
%%time
# Define the list of dataset names to load
ds_names = ["application_train", "application_test", "bureau", "bureau_balance",
"credit_card_balance", "installments_payments", "previous_application",
"POS_CASH_balance"]
# Load each dataset and add it to the `datasets` dictionary
start_time = time.time()
for ds_name in ds_names:
ds_path = os.path.join(DATA_DIR, f"{ds_name}.csv")
dataset = load_data(ds_path, ds_name)
datasets[ds_name] = dataset
print(f"{ds_name} dataset loaded successfully!")
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Data loading completed in {elapsed_time:.2f} seconds.")
for ds_name in datasets.keys():
shape_str = f"[ {datasets[ds_name].shape[0]:,}, {datasets[ds_name].shape[1]}]"
print(f"Dataset {ds_name:24}: {shape_str}")
datasets["application_train"].info()
datasets["application_train"].describe().T #numerical only features
datasets["application_test"].describe().T #numerical only features
datasets["application_train"].describe(include='all').T #look at all categorical and numerical
application_train = datasets['application_train']
application_train.dtypes
application_train.duplicated().sum().T
application_train.isnull().T
application_train.isna()
application_train.value_counts
application_train.corr()
datasets['bureau_balance'].info()
datasets['bureau_balance'].describe().T
datasets['credit_card_balance'].info()
datasets['credit_card_balance'].describe().T
datasets['installments_payments'].info()
datasets['installments_payments'].describe().T
datasets['previous_application'].info()
datasets['previous_application'].describe()
datasets['POS_CASH_balance'].info()
datasets['POS_CASH_balance'].describe()
# Compute the percentage of missing values and the count of missing values for the "application_train" dataset
application_train_missing_percent = (datasets["application_train"].isnull().sum() / datasets["application_train"].isnull().count() * 100).sort_values(ascending=False).round(2)
application_train_missing_count = datasets["application_train"].isna().sum().sort_values(ascending=False)
# Combine the percentage and count of missing values into a DataFrame
application_train_missing_data = pd.concat([application_train_missing_percent, application_train_missing_count], axis=1, keys=["Percent Missing", "Count Missing"])
# Display the first 20 rows of the missing data statistics
print("Missing Data in Application Train Dataset")
print(application_train_missing_data.head(20))
# Reset the index of the missing data DataFrame and rename the index column to "Attributes"
application_train_missing_data = application_train_missing_data.reset_index().rename(columns={"index": "Attribute"})
# Display the updated missing data DataFrame
print("Missing Data in Application Train Dataset (Updated)")
print(application_train_missing_data.head(20))
# Compute the percentage of missing values and the count of missing values for the "application_test" dataset
application_test_missing_percent = (datasets["application_test"].isnull().sum() / datasets["application_test"].isnull().count() * 100).sort_values(ascending=False).round(2)
application_test_missing_count = datasets["application_test"].isna().sum().sort_values(ascending=False)
# Combine the percentage and count of missing values into a DataFrame
application_test_missing_data = pd.concat([application_test_missing_percent, application_test_missing_count], axis=1, keys=["Percent Missing", "Count Missing"])
# Display the first 20 rows of the missing data statistics
print("Missing Data in Application Test Dataset")
print(application_test_missing_data.head(20))
# Reset the index of the missing data DataFrame and rename the index column to "Attributes"
application_test_missing_data = application_test_missing_data.reset_index().rename(columns={"index": "Attribute"})
# Display the updated missing data DataFrame
print("Missing Data in Application Test Dataset (Updated)")
print(application_test_missing_data.head(20))
plt.figure(figsize = (20, 8), dpi = 200)
plt.hist(datasets["application_train"]['TARGET'].astype(int))
plt.title("Target: Default Indicator")
sns.catplot(data = datasets["application_train"], x='TARGET', kind = 'count')
plt.xlabel('Target');
plt.ylabel('Numbers of Borrowers');
plt.title('Target values against the number of borrowers');
plt.show()
list(datasets.keys())
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
# is there an overlap between the test and train customers
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
datasets["application_test"].shape
datasets["application_train"].shape
# Display the first five rows of the "previous_application" dataset and print its shape
previous_application_df = datasets["previous_application"]
print("Previous Application Dataset:")
display(previous_application_df.head())
print(f"Rows: {previous_application_df.shape[0]:,}, Columns: {previous_application_df.shape[1]:,}")
# Print the number of previous applications
previous_applications_df = datasets["previous_application"]
num_previous_applications = previous_applications_df.shape[0]
print(f"There are {num_previous_applications:,} previous applications.")
#print(f"There are {appsDF.shape[0]:,} previous applications")
# Compute statistics on the number of previous applications per customer
previous_applications_df = datasets["previous_application"]
num_customers = previous_applications_df["SK_ID_CURR"].nunique()
num_apps = [5, 10, 40, 60]
num_apps_counts = [np.sum(previous_applications_df["SK_ID_CURR"].value_counts() >= n) for n in num_apps]
percent_apps = [np.round(100. * (count / num_customers), 2) for count in num_apps_counts]
# Print the percentage of customers with 5 or more, 10 or more, 40 or more, 60 or more, 70 or more, 80 or more, and 90 or more previous applications
for n, percent in zip(num_apps, percent_apps):
print(f"Percentage of customers with {n} or more previous applications: {percent:.2f}%")
# Filter the training data to include only the observations with a target value of 1
unpaid_application_train = application_train[application_train['TARGET'] == 1]
# Sample 75,000 observations from the training data with a target value of 0 and append them to the filtered data
paid_application_train = application_train[application_train['TARGET'] == 0].reset_index(drop=True).sample(n=75000)
und_application_train = unpaid_application_train.append(paid_application_train)
# Add the undersampled training data to the dictionary of datasets
datasets["undersampled_application_train"] = und_application_train
# Print the number of observations with a target value of 0 and 1 in the undersampled training data
target_counts = und_application_train['TARGET'].value_counts()
print(f"Number of observations with a target value of 0: {target_counts[0]}")
print(f"Number of observations with a target value of 1: {target_counts[1]}")
# Create a copy of the training data containing only observations with a target value of 1 and add a weight column
undersampled_application_train_2 = datasets["application_train"][datasets["application_train"]["TARGET"] == 1].copy()
undersampled_application_train_2["weight"] = 1
# Count the number of default loans for cash loans and revolving loans separately
num_default_cash_loans = undersampled_application_train_2[(undersampled_application_train_2["TARGET"] == 1) & (undersampled_application_train_2["NAME_CONTRACT_TYPE"] == "Cash loans")].shape[0]
num_default_revolving_loans = undersampled_application_train_2[(undersampled_application_train_2["TARGET"] == 1) & (undersampled_application_train_2["NAME_CONTRACT_TYPE"] == "Revolving loans")].shape[0]
# Add the undersampled training data to the dictionary of datasets
datasets["undersampled_application_train_2"] = undersampled_application_train_2
# Undersample cash loans from the training data with a target value of 0 to balance the number of default loans for cash loans and revolving loans
cash_loans_target_0 = datasets["application_train"][(datasets["application_train"]["NAME_CONTRACT_TYPE"] == "Cash loans") & (datasets["application_train"]["TARGET"] == 0)]
cash_loans_target_0_sample = cash_loans_target_0.sample(n=int(1.5*num_default_cash_loans), random_state=1)
cash_loans_target_0_sample_weight = cash_loans_target_0.shape[0] / int(1.5*num_default_cash_loans)
cash_loans_target_0_sample["weight"] = cash_loans_target_0_sample_weight
undersampled_application_train_2 = pd.concat([datasets["undersampled_application_train_2"], cash_loans_target_0_sample])
# Add the undersampled training data to the dictionary of datasets
datasets["undersampled_application_train_2"] = undersampled_application_train_2
# Undersample revolving loans from the training data with a target value of 0 to balance the number of default loans for cash loans and revolving loans
revolving_loans_target_0 = datasets["application_train"][(datasets["application_train"]["NAME_CONTRACT_TYPE"] == "Revolving loans") & (datasets["application_train"]["TARGET"] == 0)]
revolving_loans_target_0_sample = revolving_loans_target_0.sample(n=int(1.5*num_default_revolving_loans), random_state=1)
revolving_loans_target_0_sample_weight = revolving_loans_target_0.shape[0] / int(1.5*num_default_revolving_loans)
revolving_loans_target_0_sample["weight"] = revolving_loans_target_0_sample_weight
undersampled_application_train_2 = pd.concat([datasets["undersampled_application_train_2"], revolving_loans_target_0_sample])
# Add the undersampled training data to the dictionary of datasets
datasets["undersampled_application_train_2"] = undersampled_application_train_2
# Create a copy of the undersampled training data and store it in a new variable
undersampled_application_train_2 = datasets["undersampled_application_train_2"].copy()
# Count the number of samples in the undersampled training data with a target value of 1 and 0
undersampled_application_train_2_target_counts = undersampled_application_train_2["TARGET"].value_counts()
undersampled_application_train_2_target_counts
# Calculate the correlation between each column and the target variable in the application training data
application_train_correlations = datasets["application_train"].corr()["TARGET"].sort_values()
# Print the 10 features with the highest positive correlation with the target variable and the 10 features with the highest negative correlation with the target variable
print("Features with the highest positive correlation with the target variable:\n", application_train_correlations.tail(10))
print("\nFeatures with the highest negative correlation with the target variable:\n", application_train_correlations.head(10))
# Calculate the correlation between each column and the target variable in the application training data
application_train_correlations = datasets["application_train"].corr()["TARGET"].sort_values()
# Reset the index and rename the columns of the correlation DataFrame to more descriptive names
application_train_correlations = application_train_correlations.reset_index().rename(columns={"index": "Attribute", "TARGET": "Correlation"})
# Display the correlation DataFrame
display(application_train_correlations)
# Create a DataFrame containing the target variable and the client's age in years
age_data = datasets["undersampled_application_train_2"][["TARGET", "DAYS_BIRTH"]]
age_data["YEARS"] = age_data["DAYS_BIRTH"] / 365
# Bin the ages into 10 evenly spaced bins between 20 and 70 years old
age_data["AGE_BIN"] = pd.cut(age_data["YEARS"], bins=np.linspace(20, 70, num=10))
# Display the first 15 rows of the age DataFrame
age_data.head(15)
# Group the age DataFrame by the age bins and calculate the mean target variable and age in years for each bin
age_grouped = age_data.groupby("AGE_BIN").mean()
# Display the grouped DataFrame
age_grouped
In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
# Calculate the number of missing values in each column of the previous applications DataFrame
previous_applications_missing_values = datasets["previous_application"].isna().sum()
# Display the missing value counts
previous_applications_missing_values
# Define a list of features to include in the analysis
features_to_analyze = ["AMT_ANNUITY", "AMT_APPLICATION"]
# Calculate descriptive statistics for the selected features in the previous applications DataFrame
feature_stats = datasets["previous_application"][features_to_analyze].describe()
# Display the descriptive statistics
print(feature_stats)
# Define a list of aggregation operations to perform on the previous applications DataFrame
aggregation_operations = ["min", "max", "mean"]
# Group the previous applications DataFrame by client ID and calculate the mean of each group for the selected features
grouped_results = datasets["previous_application"].groupby("SK_ID_CURR")[features_to_analyze].agg(aggregation_operations)
# Display the first 5 rows of the grouped results DataFrame
display(grouped_results.head())
# Define a list of features to include in the analysis
features_to_analyze = ["AMT_ANNUITY", "AMT_APPLICATION"]
# Calculate descriptive statistics for the selected features in the previous applications DataFrame
feature_stats = datasets["previous_application"][features_to_analyze].describe()
# Display the descriptive statistics
print(feature_stats)
# Define a list of aggregation operations to perform on the previous applications DataFrame
aggregation_operations = ["min", "max", "mean"]
# Group the previous applications DataFrame by client ID and calculate the mean of each group for the selected features
grouped_results = datasets["previous_application"].groupby("SK_ID_CURR")[features_to_analyze].agg(aggregation_operations)
# Display the first 5 rows of the grouped results DataFrame
display(grouped_results.head())
# Check for missing values in the grouped and aggregated results DataFrame
print(grouped_results.isna().sum())
# Create aggregate features (via pipeline)
class FeaturesAggregater(BaseEstimator, TransformerMixin):
def __init__(self, features=None, agg_needed=["mean"]): # no *args or **kargs
self.features = features
self.agg_needed = agg_needed
self.agg_op_features = {}
for f in features:
self.agg_op_features[f] = self.agg_needed[:]
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
df_result = pd.DataFrame()
for x1, x2 in result.columns:
new_col = x1 + "_" + x2
df_result[new_col] = result[x1][x2]
df_result = df_result.reset_index(level=["SK_ID_CURR"])
return df_result
datasets["previous_application"].isna().sum()
# from sklearn.pipeline import make_pipeline
# previous_feature = ["AMT_APPLICATION", "AMT_CREDIT", "AMT_ANNUITY", "approved_credit_ratio",
# "AMT_ANNUITY_credit_ratio", "Interest_ratio", "LTV_ratio", "SK_ID_PREV", "approved"]
# agg_needed = ["min", "max", "mean", "count", "sum"]
# def previous_feature_aggregation(df, feature, agg_needed):
# df['approved_credit_ratio'] = (df['AMT_APPLICATION']/df['AMT_CREDIT']).replace(np.inf, 0)
# # installment over credit approved ratio
# df['AMT_ANNUITY_credit_ratio'] = (df['AMT_ANNUITY']/df['AMT_CREDIT']).replace(np.inf, 0)
# # total interest payment over credit ratio
# df['Interest_ratio'] = (df['AMT_ANNUITY']/df['AMT_CREDIT']).replace(np.inf, 0)
# # loan cover ratio
# df['LTV_ratio'] = (df['AMT_CREDIT']/df['AMT_GOODS_PRICE']).replace(np.inf, 0)
# df['approved'] = np.where(df.AMT_CREDIT >0 ,1, 0)
# test_pipeline = make_pipeline(FeaturesAggregater(feature, agg_needed))
# return(test_pipeline.fit_transform(df))
# datasets['previous_application_agg'] = previous_feature_aggregation(datasets["previous_application"], previous_feature, agg_needed)
from sklearn.pipeline import make_pipeline
previous_feature = ["AMT_APPLICATION", "AMT_CREDIT", "AMT_ANNUITY", "approved_credit_ratio",
"AMT_ANNUITY_credit_ratio", "Interest_ratio", "LTV_ratio", "SK_ID_PREV", "approved"]
agg_needed = ["min", "max", "mean", "count", "sum"]
def previous_feature_aggregation(df, feature, agg_needed):
df['approved_credit_ratio'] = (df['AMT_APPLICATION']/df['AMT_CREDIT']).replace(np.inf, 0)
# installment over credit approved ratio
df['AMT_ANNUITY_credit_ratio'] = (df['AMT_ANNUITY']/df['AMT_CREDIT']).replace(np.inf, 0)
# total interest payment over credit ratio
df['Interest_ratio'] = (df['AMT_ANNUITY']/df['AMT_CREDIT']).replace(np.inf, 0)
# loan cover ratio
df['LTV_ratio'] = (df['AMT_CREDIT']/df['AMT_GOODS_PRICE']).replace(np.inf, 0)
df['approved'] = np.where(df.AMT_CREDIT >0 ,1, 0)
test_pipeline = make_pipeline(FeaturesAggregater(feature, agg_needed))
return(test_pipeline.fit_transform(df))
datasets['previous_application_agg'] = previous_feature_aggregation(datasets["previous_application"], previous_feature, agg_needed)
datasets["previous_application_agg"].isna().sum()
datasets["installments_payments"].isna().sum()
payments_features = ["DAYS_INSTALMENT_DIFF", "AMT_PATMENT_PCT"]
agg_needed = ["mean"]
def payments_feature_aggregation(df, feature, agg_needed):
df['DAYS_INSTALMENT_DIFF'] = df['DAYS_INSTALMENT'] - df['DAYS_ENTRY_PAYMENT']
df['AMT_PATMENT_PCT'] = [x/y if (y != 0) & pd.notnull(y) else np.nan for x,y in zip(df.AMT_PAYMENT,df.AMT_INSTALMENT)]
test_pipeline = make_pipeline(FeaturesAggregater(feature, agg_needed))
return(test_pipeline.fit_transform(df))
datasets['installments_payments_agg'] = payments_feature_aggregation(datasets["installments_payments"], payments_features, agg_needed)
datasets["installments_payments_agg"].isna().sum()
datasets["credit_card_balance"].isna().sum()
credit_features = [
"AMT_BALANCE",
"AMT_DRAWINGS_PCT",
"AMT_DRAWINGS_ATM_PCT",
"AMT_DRAWINGS_OTHER_PCT",
"AMT_DRAWINGS_POS_PCT",
"AMT_PRINCIPAL_RECEIVABLE_PCT",
"CNT_DRAWINGS_ATM_CURRENT",
"CNT_DRAWINGS_CURRENT",
"CNT_DRAWINGS_OTHER_CURRENT",
"CNT_DRAWINGS_POS_CURRENT",
"SK_DPD",
"SK_DPD_DEF",
]
agg_needed = ["mean"]
def calculate_pct(x, y):
return x / y if (y != 0) & pd.notnull(y) else np.nan
def credit_feature_aggregation(df, feature, agg_needed):
pct_columns = [
("AMT_DRAWINGS_CURRENT", "AMT_DRAWINGS_PCT"),
("AMT_DRAWINGS_ATM_CURRENT", "AMT_DRAWINGS_ATM_PCT"),
("AMT_DRAWINGS_OTHER_CURRENT", "AMT_DRAWINGS_OTHER_PCT"),
("AMT_DRAWINGS_POS_CURRENT", "AMT_DRAWINGS_POS_PCT"),
("AMT_RECEIVABLE_PRINCIPAL", "AMT_PRINCIPAL_RECEIVABLE_PCT"),
]
for col_x, col_pct in pct_columns:
df[col_pct] = [calculate_pct(x, y) for x, y in zip(df[col_x], df["AMT_CREDIT_LIMIT_ACTUAL"])]
pipeline = make_pipeline(FeaturesAggregater(feature, agg_needed))
return pipeline.fit_transform(df)
datasets["credit_card_balance_agg"] = credit_feature_aggregation(
datasets["credit_card_balance"], credit_features, agg_needed
)
datasets["credit_card_balance_agg"].isna().sum()
HCDR data preprocessing encompassing all application train columns was undertaken with meticulous attention to detail. The application train subset was selected as it constitutes the primary table with 121 columns. To streamline the preprocessing phase, we selected seven of the most salient numerical and categorical features. The application_train table harbors pertinent attributes such as the applicant's age, gender, income amount, among others, which are significant determinants for predicting loan defaulters.
Following the preprocessing pipeline, we created two datasets, the training dataset, and the validation dataset. The training dataset was utilized to train the machine learning model, while the validation dataset was utilized to evaluate the model's performance. We achieved this by splitting the preprocessed data into training and validation. This splitting ensured that the model was trained on a sufficiently large dataset and also tested on a substantial amount of data to assess its effectiveness.
HCDR data preprocessing and dataset creation are indispensable stages in constructing a precise machine learning model to predict loan defaulters.
# Load the train dataset
train_data = datasets["application_train"]
# Compute the distribution of the target variable
target_counts = train_data['TARGET'].value_counts()
# Display the target distribution
print("Target variable distribution:\n")
print(target_counts)
print("\n")
# Compute the percentage of positive and negative examples in the dataset
positive_count = target_counts[1]
negative_count = target_counts[0]
total_count = positive_count + negative_count
positive_percentage = (positive_count / total_count) * 100
negative_percentage = (negative_count / total_count) * 100
# Display the percentages of positive and negative examples
print(f"Percentage of positive examples: {positive_percentage:.2f}%")
print(f"Percentage of negative examples: {negative_percentage:.2f}%")
class_labels = ["No Default","Default"]
# Create a class to select numerical or categorical columns since Scikit-Learn doesn't handle DataFrames yet
# Import necessary libraries
from sklearn.base import BaseEstimator, TransformerMixin
# Create a transformer to select numerical or categorical columns
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.columns].values
# Identify numerical and categorical columns in the train dataset
num_cols = train_data.select_dtypes(include=["float64", "int64"]).columns.tolist()
cat_cols = train_data.select_dtypes(include=["object"]).columns.tolist()
# Remove the target and ID columns from the numerical columns list
num_cols.remove("TARGET")
num_cols.remove("SK_ID_CURR")
categorical_pipeline = Pipeline([
('category_selector', ColumnSelector(cat_cols)),
('category_imputer', SimpleImputer(strategy='most_frequent')),
('one_hot_encoder', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
numerical_pipeline = Pipeline([
('number_selector', ColumnSelector(num_cols)),
('number_imputer', SimpleImputer(strategy='mean')),
('standard_scaler', StandardScaler()),
])
data_preparation_pipeline = FeatureUnion(transformer_list=[
("numerical_pipeline", numerical_pipeline),
("categorical_pipeline", categorical_pipeline),
])
The numerical and categorical pipelines were combined using FeatureUnion to prepare for the data pipeline. FeatureUnion is a class that allows combining multiple pipelines to handle numerical and categorical features simultaneously. The output of both pipelines was merged into a single feature set, which was passed to the data pipeline. The data pipeline included feature scaling, feature selection, and splitting the data into training and validation datasets using train_test_split.
(data_preparation_pipeline)
# Combine the numerical and categorical features into a single list
selected_features = num_cols + cat_cols + ["SK_ID_CURR"]
# Compute the total number of features and their breakdown by type
num_features = len(num_cols)
cat_features = len(cat_cols)
tot_features = f"{len(selected_features)}: Num:{num_features}, Cat:{cat_features}"
tot_features
X_kaggle_test= datasets["application_test"]
# Import necessary libraries
from sklearn.model_selection import train_test_split
# Split the train dataset into train, validation, and test sets
y_train = train_data['TARGET']
X_train = train_data[selected_features]
X_train_valid, X_test, y_train_valid, y_test = train_test_split(
X_train,
y_train,
test_size=0.15,
random_state=42
)
X_train, X_valid, y_train, y_valid = train_test_split(
X_train_valid,
y_train_valid,
test_size=0.15,
random_state=42
)
# Filter the selected features in the Kaggle test dataset
X_kaggle_test = X_kaggle_test[selected_features]
# Display the shapes of the resulting datasets
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X Kaggle test shape: {X_kaggle_test.shape}")
Every trial in this undertaking shall be identified by Pipeline configuration
I intend to formulate five distinct experiments, predicated on the ensuing criteria:
all normal features
undersampled features
baseline machine learning models.
Prior to embarking on model selection through the process of model comparison and model design, it is imperative to initialize a logbook and establish a precise loss metric framework. This shall allow for the systematic tracking and monitoring of model performance across various iterations, and facilitate the comprehensive evaluation of model efficacy in accordance with pre-defined objectives. Such a structured approach is essential to ensure robust and accurate analyses, and to enable the identification of optimal model configurations that deliver maximal predictive power and generalizability.
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["exp_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train AUC",
"Valid AUC",
"Test AUC",
"Train F1 Score",
"Valid F1 Score",
"Test F1 Score"
])
The evaluation of submissions is conducted through the calculation of the area under the ROC curve, which measures the relationship between the predicted probability and the observed target. The SkLearn roc_auc_score function is utilized to compute the AUC or AUROC, effectively summarizing the information contained in the ROC curve into a single numerical value.
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
It refers to the proportion of accurately classified data instances in relation to the overall number of data instances.
It is a metric that reflects the degree of proximity between the predicted probability and the true value (which is typically 0 or 1 in the context of binary classification). The log-loss value increases as the predicted probability increasingly deviates from the true value.
Precision refers to the ratio of true positives to the sum of true positives and false positives.
It denotes the fraction of positive instances that are correctly identified as positive by the model. This metric is equivalent to the TPR (True Positive Rate).
It is the harmonic mean of accuracy and recall, taking into account both false positives and false negatives. It is a useful metric for evaluating models on imbalanced datasets.
It is a tabular representation consisting of two axes - one representing the actual values and the other representing the predicted values. The matrix is of size 2x2 and is commonly used in classification tasks to assess the performance of a model.
In order to establish a baseline, we shall employ certain preprocessed features through our pipeline. The model that we shall utilize for this purpose is a logistic regression model.
from IPython.display import Image
Image(filename='linear_block.png')
import matplotlib.pyplot as plt
import matplotlib.patheffects as path_effects
data = [num_features,cat_features]
labels = ['Numerical Features ', 'Categorical Features']
fig, ax = plt.subplots()
bars = ax.bar(labels, data, color=['#0072B2', '#E69F00'], edgecolor='black')
# Add shadows to the bars
for bar in bars:
bar.set_edgecolor('gray')
bar.set_linewidth(1)
bar.set_zorder(0)
# Add labels to the bars
height = bar.get_height()
ax.annotate(f'{height:.0f}', xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3), textcoords='offset points', ha='center', va='bottom',
fontsize=12, fontweight='bold')
# Customize the axis labels and ticks
ax.set_xlabel('Data Type', fontsize=14, fontweight='bold')
ax.set_ylabel('Number of features ', fontsize=14, fontweight='bold')
ax.tick_params(axis='both', labelsize=12)
# Customize the plot background
ax.set_facecolor('#F0F0F0')
fig.set_facecolor('#F0F0F0')
ax.spines['bottom'].set_color('gray')
ax.spines['left'].set_color('gray')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
%%time
# Import necessary libraries
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
# Fit a logistic regression model to the train dataset
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_preparation_pipeline),
("linear", LogisticRegression())
])
model = full_pipeline_with_predictor.fit(X_train, y_train)
# Display the time taken to fit the model
print("Model trained successfully.\n")
# Import necessary libraries
from sklearn.metrics import accuracy_score
# Compute the training accuracy of the model
train_acc = accuracy_score(y_train, model.predict(X_train))
# Display the training accuracy
print(f"Training accuracy: {train_acc:.3f}")
Cross-fold validation constitutes a methodology aimed at assessing the efficacy of a machine learning model, whereby the data is partitioned into multiple subsets, or "folds," and the model is trained on various permutations of said folds
from sklearn.model_selection import ShuffleSplit
cvSplits = ShuffleSplit(n_splits=3, test_size=0.7, random_state=42)
# Import necessary libraries
from sklearn.model_selection import StratifiedKFold
# Create a StratifiedKFold object to generate cross-validation splits
cv_splits = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
import time
def pct(x):
return round(100 * x, 3)
# Start measuring time
start = time.time()
# Fit the model
model = full_pipeline_with_predictor.fit(X_train, y_train)
np.random.seed(42)
# Calculate cross-validation scores
logit_scores = cross_val_score(full_pipeline_with_predictor, X_train, y_train, cv=cvSplits)
logit_score_train = pct(logit_scores.mean())
# Measure training time
train_time = np.round(time.time() - start, 4)
# Start measuring test prediction time
start = time.time()
# Calculate test score
logit_score_test = full_pipeline_with_predictor.score(X_test, y_test)
# Measure test prediction time
test_time = np.round(time.time() - start, 4)
# Print the test accuracy
print(f"Test Accuracy: {logit_score_test * 100:.3f}%")
# Import necessary libraries
from sklearn.metrics import roc_auc_score
# Compute the training AUC score of the model
train_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
# Display the training AUC score
print(f"Training AUC score: {train_auc:.3f}")
Verily, a confusion matrix is a tabulation that doth summarize the performance of a classification model by comparing the predicted labels with the actual labels. For a binary classification problem, the matrix contains True Positive, False Positive, False Negative, and True Negative values. The True Positive and True Negative be true predictions, while the False Positive and False Negative be false predictions.
creating confusion matrix function for baseline model
import numpy as np
from sklearn.metrics import confusion_matrix
def confusion_matrix_normalized(model, X_train, y_train, X_test, y_test):
# Predict on train and test data
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Calculate confusion matrices for train and test data
cm_train = confusion_matrix(y_train, y_pred_train, normalize='true')
cm_test = confusion_matrix(y_test, y_pred_test, normalize='true')
return cm_train, cm_test
# Compute normalized confusion matrices for the model on the training and test sets
cm_train, cm_test = confusion_matrix_normalized(model, X_train, y_train, X_test, y_test)
import seaborn as sns
import matplotlib.pyplot as plt
# Set the class labels for the confusion matrix
class_labels = ["0", "1"]
# Compute normalized confusion matrices for the model on the training and test sets
cm_train, cm_test = confusion_matrix_normalized(model, X_train, y_train, X_test, y_test)
# Create a figure with two subplots for the confusion matrices
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(23, 8))
# Plot the normalized confusion matrix for the training set
sns.set(font_scale=1.2)
sns.heatmap(
cm_train,
vmin=0,
vmax=1,
annot=True,
cmap="BuPu",
xticklabels=class_labels,
yticklabels=class_labels,
ax=ax1,
)
ax1.set_xlabel("Predicted", fontsize=15)
ax1.set_ylabel("True", fontsize=15)
ax1.set_title("Train", fontsize=18)
# Plot the normalized confusion matrix for the test set
sns.set(font_scale=1.2)
sns.heatmap(
cm_test,
vmin=0,
vmax=1,
annot=True,
cmap="YlOrRd",
xticklabels=class_labels,
yticklabels=class_labels,
ax=ax2,
)
ax2.set_xlabel("Predicted", fontsize=15)
ax2.set_ylabel("True", fontsize=15)
ax2.set_title("Test", fontsize=18)
# Display the plot
plt.show()
from sklearn.metrics import f1_score
pred = model.predict(X_test)
plt.hist(pred, color='tab:blue', alpha=0.7, edgecolor='black')
f1_train = f1_score(y_train, model.predict(X_train))
f1_valid = f1_score(y_valid, model.predict(X_valid))
f1_test = f1_score(y_test, model.predict(X_test))
print("F1 Score for Test set: ", f1_test)
plt.title('Histogram of Predictions on Test Set', fontsize=16)
plt.xlabel('Prediction', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(axis='y', alpha=0.4)
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
exp_name = f"Baseline_Model-1:Logistic Regression with {len(selected_features)}_features"
# Model Training and Validation
model.fit(X_train, y_train)
# Model Predictions
train_preds = model.predict_proba(X_train)[:, 1]
valid_preds = model.predict_proba(X_valid)[:, 1]
test_preds = model.predict_proba(X_test)[:, 1]
# Compute Metrics
train_fpr, train_tpr, _ = roc_curve(y_train, train_preds)
valid_fpr, valid_tpr, _ = roc_curve(y_valid, valid_preds)
test_fpr, test_tpr, _ = roc_curve(y_test, test_preds)
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test, model.predict(X_test)),
roc_auc_score(y_train, train_preds),
roc_auc_score(y_valid, valid_preds),
roc_auc_score(y_test, test_preds),
f1_train, f1_valid, f1_test],
4))
# Plot ROC Curve
plt.plot(train_fpr, train_tpr, label="Train ROC Curve")
plt.plot(valid_fpr, valid_tpr, label="Validation ROC Curve")
plt.plot(test_fpr, test_tpr, label="Test ROC Curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
expLog
The selected features were chosen based on their potential predictive power for the target variable "TARGET". Specifically, we performed correlation analysis on the application train data and selected the columns with the highest correlation to the target variable. Some of these features, such as "DAYS_BIRTH", "DAYS_EMPLOYED", and "EXT_SOURCE_1", are related to the age and income of the borrower, which are likely to affect their ability to repay a loan. Other features, such as "NAME_EDUCATION_TYPE" and "OCCUPATION_TYPE", may also be related to the borrower's financial situation and ability to repay a loan. Finally, "CODE_GENDER" and "FLAG_OWN_CAR" and "FLAG_OWN_REALTY" may also provide insights into the borrower's overall financial stability and responsibility.
# Define numeric and categorical attributes
numeric_attributes = [
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_EMPLOYED', 'DAYS_BIRTH',
'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3'
]
categorical_attributes = [
'CODE_GENDER', 'FLAG_OWN_REALTY', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'NAME_INCOME_TYPE'
]
# Combine selected features
selected_features = numeric_attributes + categorical_attributes + ["SK_ID_CURR"]
total_features = f"{len(selected_features)}: Num:{len(numeric_attributes)}, Cat:{len(categorical_attributes)}"
print(total_features)
# Split the provided training data into training, validation, and test sets
y_train = train_data['TARGET']
X_train = train_data[selected_features]
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_kaggle_test = X_kaggle_test[selected_features]
# Print dataset shapes
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
# Define categorical pipeline
categorical_pipeline = Pipeline([
('category_selector', ColumnSelector(categorical_attributes)),
('category_imputer', SimpleImputer(strategy='most_frequent')),
('one_hot_encoder', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
# Define numeric pipeline
numeric_pipeline = Pipeline([
('number_selector', ColumnSelector(numeric_attributes)),
('number_imputer', SimpleImputer(strategy='mean')),
('standard_scaler', StandardScaler()),
])
# Combine pipelines
data_preparation_pipeline = FeatureUnion(transformer_list=[
("numeric_pipeline", numeric_pipeline),
("categorical_pipeline", categorical_pipeline),
])
# Set random seed
np.random.seed(42)
# Define the full pipeline with predictor
full_pipeline_with_predictor = Pipeline([
("preparation", data_preparation_pipeline),
("linear", LogisticRegression())
])
# Fit the model
model = full_pipeline_with_predictor.fit(X_train, y_train)
# Calculate F1 scores
f1_train = f1_score(y_train, model.predict(X_train))
f1_valid = f1_score(y_valid, model.predict(X_valid))
f1_test = f1_score(y_test, model.predict(X_test))
# Add results to the experiment log
exp_name = f"Baseline_Model-2:Logistic Regression with {len(selected_features)}_features after correlation analysis"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test, model.predict(X_test)),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
f1_train, f1_valid, f1_test],
4))
# Plot ROC curves for the train, validation, and test sets
train_fpr, train_tpr, _ = roc_curve(y_train, model.predict_proba(X_train)[:, 1])
valid_fpr, valid_tpr, _ = roc_curve(y_valid, model.predict_proba(X_valid)[:, 1])
test_fpr, test_tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
plt.plot(train_fpr, train_tpr, label='Train ROC Curve')
plt.plot(valid_fpr, valid_tpr, label='Validation ROC Curve')
plt.plot(test_fpr, test_tpr, label='Test ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
expLog
datasets.keys()
train_dataset = datasets["undersampled_application_train"] # primary dataset
merge_all_data = True
if merge_all_data:
# Join/Merge in prevApps Data
train_dataset = train_dataset.merge(
datasets["previous_application_agg"],
how='left',
on='SK_ID_CURR'
)
# Join/Merge in Installments Payments Data
train_dataset = train_dataset.merge(
datasets["installments_payments_agg"],
how='left',
on="SK_ID_CURR"
)
# Join/Merge in Credit Card Balance Data
train_dataset = train_dataset.merge(
datasets["credit_card_balance_agg"],
how='left',
on="SK_ID_CURR"
)
datasets["und_4_datasets"] = train_dataset
train_dataset.shape
X_kaggle_test = datasets["application_test"]
if merge_all_data:
# Join/Merge in prevApps Data
X_kaggle_test = X_kaggle_test.merge(
datasets["previous_application_agg"],
how='left',
on='SK_ID_CURR'
)
# Join/Merge in Installments Payments Data
X_kaggle_test = X_kaggle_test.merge(
datasets["installments_payments_agg"],
how='left',
on="SK_ID_CURR"
)
# Join/Merge in Credit Card Balance Data
X_kaggle_test = X_kaggle_test.merge(
datasets["credit_card_balance_agg"],
how='left',
on="SK_ID_CURR"
)
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
Here is a example that in action:
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
In the realm of training, validation, and test sets, the OneHotEncoder is endowed upon the training set. Each feature's sui generis value from the training set engenders a novel column. The encoding process results in the vanishing of the initial column names and values, leaving behind a numpy array as the output. Nevertheless, upon transforming the test set, a ValueError arises post-fitting of the encoder to the training set. This predicament transpires due to the presence of previously unseen, avant-garde unique values in the test set, which the encoder fails to handle. In order to utilize both the transformed training and test sets in machine learning algorithms, they must contain an equivalent number of columns. This obstacle can be surmounted by employing the option "handle_unknown='ignore'" of the OneHotEncoder, which dismisses unfamiliar values during the transformation of the test set.
We elected to incorporate supplementary tables :
Additional tables, such as Previous Application, Installment Payments, and Credit Card Applications, were included alongside the application_train dataset for HCDR preprocessing.
This decision was made due to low accuracy when only using the application_train data, which was caused by imbalanced data and required undersampling of non-defaulters.
The other tables were found to contain significant features according to correlation analysis.
Data is split into training, validation, and testing sets, then preprocessed with separate categorical and numerical pipelines before being combined into a complete data pipeline for our machine learning model pipeline
# Define numerical and categorical attribute lists
numerical_attributes = []
categorical_attributes = []
# Iterate over each feature in the train dataset
for feature in train_dataset.columns:
# Determine if the feature is numerical or categorical
if train_dataset[feature].dtype == np.float64 or train_dataset[feature].dtype == np.int64:
numerical_attributes.append(feature)
else:
categorical_attributes.append(feature)
# Remove TARGET and SK_ID_CURR from the numerical attributes list
numerical_attributes.remove('TARGET')
numerical_attributes.remove('SK_ID_CURR')
# Define the categorical pipeline
categorical_pipeline = Pipeline([
('selector', ColumnSelector(categorical_attributes)),
('imputer', SimpleImputer(strategy='most_frequent')),
('one_hot_encoder', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
# Define the numerical pipeline
numerical_pipeline = Pipeline([
('selector', ColumnSelector(numerical_attributes)),
('imputer', SimpleImputer(strategy='mean')),
('standard_scaler', StandardScaler()),
])
# Combine the categorical and numerical pipelines into a single data preparation pipeline
data_prep_pipeline = FeatureUnion(transformer_list=[
("numerical_pipeline", numerical_pipeline),
("categorical_pipeline", categorical_pipeline),
])
# Define the final list of selected features
selected_features = numerical_attributes + categorical_attributes + ["SK_ID_CURR"]
# Generate a string summarizing the total number of selected features
tot_features = f"{len(selected_features)} features selected: Numerical={len(numerical_attributes)}, Categorical={len(categorical_attributes)}"
# Split the provided training data into training and validation and test
# The kaggle evaluation test set has no labels
from sklearn.model_selection import train_test_split
y_train = train_dataset['TARGET']
X_train = train_dataset[selected_features]
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_kaggle_test= X_kaggle_test[selected_features]
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
For establishing a baseline, we shall employ certain processed features stemming from the pipeline. The logistic regression model shall serve as the rudimentary benchmark model 2 . We will use two undersampled data here .
Here is First undersampled data with 180 feature sets .
import matplotlib.pyplot as plt
import matplotlib.patheffects as path_effects
data = [len(numerical_attributes),len(categorical_attributes)]
labels = ['Numerical Features ', 'Categorical Features']
fig, ax = plt.subplots()
bars = ax.bar(labels, data, color=['#0072B2', '#E69F00'], edgecolor='black')
# Add shadows to the bars
for bar in bars:
bar.set_edgecolor('gray')
bar.set_linewidth(1)
bar.set_zorder(0)
# Add labels to the bars
height = bar.get_height()
ax.annotate(f'{height:.0f}', xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3), textcoords='offset points', ha='center', va='bottom',
fontsize=12, fontweight='bold')
# Customize the axis labels and ticks
ax.set_xlabel('Data Type', fontsize=14, fontweight='bold')
ax.set_ylabel('Number of features ', fontsize=14, fontweight='bold')
ax.tick_params(axis='both', labelsize=12)
# Customize the plot background
ax.set_facecolor('#F0F0F0')
fig.set_facecolor('#F0F0F0')
ax.spines['bottom'].set_color('gray')
ax.spines['left'].set_color('gray')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
from sklearn.metrics import accuracy_score
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_preparation_pipeline),
("linear", LogisticRegression())
])
# Fit the pipeline to the training data and create a pipelined model
full_pipeline_with_predictor.fit(X_train, y_train)
pipelined_model = full_pipeline_with_predictor
# Compute the accuracy score of the trained model on the training data
accuracy = np.round(accuracy_score(y_train, pipelined_model.predict(X_train)), 3)
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
import time
np.random.seed(42)
# Split the training data into 3 shuffled folds for cross-validation
cv_splits = ShuffleSplit(n_splits=3, test_size=0.7, random_state=42)
# Fit the pipelined model to the training data and time it
start_train = time.time()
pipelined_model = full_pipeline_with_predictor.fit(X_train, y_train)
train_time = np.round(time.time() - start_train, 4)
# Compute the cross-validation accuracy score of the pipelined model on the training data
cross_val_scores = cross_val_score(pipelined_model, X_train, y_train, cv=cv_splits)
cross_val_score_train = np.round(cross_val_scores.mean(), 3)
# Time and score the pipelined model on the test data
start_test = time.time()
cross_val_score_test = pipelined_model.score(X_test, y_test)
test_time = np.round(time.time() - start_test, 4)
roc_auc_score(y_train, pipelined_model.predict_proba(X_train)[:, 1])
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, roc_curve
# Compute the F1 score for the training, validation, and test sets
f1_train = f1_score(y_train, pipelined_model.predict(X_train))
f1_valid = f1_score(y_valid, pipelined_model.predict(X_valid))
f1_test = f1_score(y_test, pipelined_model.predict(X_test))
# Define an experiment name and log the results of the experiment
experiment_name = f"Baseline_Model-3:Logistic Regression with_undersampling_one {len(selected_features)}_features"
experiment_results = [
accuracy_score(y_train, pipelined_model.predict(X_train)),
accuracy_score(y_valid, pipelined_model.predict(X_valid)),
accuracy_score(y_test, pipelined_model.predict(X_test)),
roc_auc_score(y_train, pipelined_model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, pipelined_model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, pipelined_model.predict_proba(X_test)[:, 1]),
f1_train,
f1_valid,
f1_test,
]
logged_experiment = [f"{experiment_name}"] + list(np.round(experiment_results, 4))
expLog.loc[len(expLog)] = logged_experiment
# ROC Curve
train_fpr, train_tpr, _ = roc_curve(y_train, pipelined_model.predict_proba(X_train)[:, 1])
valid_fpr, valid_tpr, _ = roc_curve(y_valid, pipelined_model.predict_proba(X_valid)[:, 1])
test_fpr, test_tpr, _ = roc_curve(y_test, pipelined_model.predict_proba(X_test)[:, 1])
plt.plot(train_fpr, train_tpr, label='Train ROC Curve')
plt.plot(valid_fpr, valid_tpr, label='Validation ROC Curve')
plt.plot(test_fpr, test_tpr, label='Test ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
expLog
Here is Second undersampled data with same number of feature sets as above .
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression, Lasso, SGDClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
# Merge datasets
train_dataset = datasets["undersampled_application_train_2"]
train_dataset = train_dataset.merge(datasets["previous_application_agg"], how="left", on="SK_ID_CURR")
train_dataset = train_dataset.merge(datasets["installments_payments_agg"], how="left", on="SK_ID_CURR")
train_dataset = train_dataset.merge(datasets["credit_card_balance_agg"], how="left", on="SK_ID_CURR")
import matplotlib.pyplot as plt
import matplotlib.patheffects as path_effects
data = [len(numerical_attributes),len(categorical_attributes)]
labels = ['Numerical Features ', 'Categorical Features']
fig, ax = plt.subplots()
bars = ax.bar(labels, data, color=['#0072B2', '#E69F00'], edgecolor='black')
# Add shadows to the bars
for bar in bars:
bar.set_edgecolor('gray')
bar.set_linewidth(1)
bar.set_zorder(0)
# Add labels to the bars
height = bar.get_height()
ax.annotate(f'{height:.0f}', xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3), textcoords='offset points', ha='center', va='bottom',
fontsize=12, fontweight='bold')
# Customize the axis labels and ticks
ax.set_xlabel('Data Type', fontsize=14, fontweight='bold')
ax.set_ylabel('Number of features ', fontsize=14, fontweight='bold')
ax.tick_params(axis='both', labelsize=12)
# Customize the plot background
ax.set_facecolor('#F0F0F0')
fig.set_facecolor('#F0F0F0')
ax.spines['bottom'].set_color('gray')
ax.spines['left'].set_color('gray')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
from IPython.display import Image
Image(filename='pipe_block.png')
For the under-sampled dataset 2, there are three Baseline models included in the pipeline, namely Model 3, Model 4, and Model 5. . By including these models in the pipeline, it is hoped that the resulting predictions will be reliable and useful in addressing the challenges posed by the imbalanced dataset.
from sklearn.metrics import auc
# Define a column transformer to separate categorical and numeric features
num_transformer = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
])
cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore")),
])
preprocessor = ColumnTransformer([
('num', num_transformer, numerical_attributes),
('cat', cat_transformer, categorical_attributes)
])
# Define the Logistic Regression pipeline
logreg_pipeline = Pipeline([
('preprocessor', preprocessor),
('logreg', LogisticRegression())
])
# Define the Lasso Regression pipeline
lasso_pipeline = Pipeline([
('preprocessor', preprocessor),
('lasso', Lasso(alpha=0.1)),
])
# Define the SGD Lasso Regression pipeline
sgd_lasso_pipeline = Pipeline([
('preprocessor', preprocessor),
('sgd', SGDClassifier(loss='squared_hinge', penalty='l1', alpha=0.1)),
])
# Define the pipeline for each model
models = {
'Baseline_Model-4:Lasso Regression': lasso_pipeline,
'Baseline_Model-5:SGD Lasso Regression': sgd_lasso_pipeline,
'Baseline_Model-6:Logistic Regression': logreg_pipeline,
}
# Train and evaluate each model
for model_name, model in models.items():
print(f'Training and evaluating {model_name}...')
trained_model = model.fit(X_train, y_train)
f1_train = f1_score(y_train, trained_model.predict(X_train) > 0.5)
f1_valid = f1_score(y_valid, trained_model.predict(X_valid) > 0.5)
f1_test = f1_score(y_test, trained_model.predict(X_test) > 0.5)
exp_name = f'{model_name}_with_undersampling_Two {len(selected_features)}_features'
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
[accuracy_score(y_train, trained_model.predict(X_train) > 0.5),
accuracy_score(y_valid, trained_model.predict(X_valid) > 0.5),
accuracy_score(y_test, trained_model.predict(X_test) > 0.5),
roc_auc_score(y_train, trained_model.predict(X_train)),
roc_auc_score(y_valid, trained_model.predict(X_valid)),
roc_auc_score(y_test, trained_model.predict(X_test)),
f1_train, f1_valid, f1_test],
4))
print(f'{model_name} training and evaluation complete.\n')
# Compute confusion matrices for train and test sets
cm_train = confusion_matrix(y_train, trained_model.predict(X_train) > 0.5)
cm_test = confusion_matrix(y_test, trained_model.predict(X_test) > 0.5)
# Plot the confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].set_title(f'Train set confusion matrix for {model_name}')
sns.heatmap(cm_train, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[1].set_title(f'Test set confusion matrix for {model_name}')
sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues', ax=axes[1])
plt.show()
if model_name=='Baseline_Model-4:Lasso Regression':
from sklearn.metrics import precision_recall_curve, average_precision_score
# Train the Lasso model
lasso_pipeline.fit(X_train, y_train)
# Compute the predicted probabilities for the test set
y_proba = lasso_pipeline.predict(X_test)
# Compute the precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# Compute the average precision score
average_precision = average_precision_score(y_test, y_proba)
# Plot the precision-recall curve
plt.plot(recall, precision, color='navy', lw=2, label='Precision-Recall curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Lasso Precision-Recall curve: AP={0:0.2f}'.format(average_precision))
plt.legend(loc="lower left")
plt.show()
if model_name == 'Baseline_Model-5:SGD Lasso Regression':
# Compute predicted scores for the test set
y_score = trained_model.decision_function(X_test)
# Compute precision and recall for various thresholds
precision, recall, thresholds = precision_recall_curve(y_test, y_score)
# Compute AUC-PR score
auc_score = auc(recall, precision)
# Plot the precision-recall curve
plt.plot(recall, precision, label=f'{model_name}, AUC-PR = {auc_score:.4f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.show()
if model_name=='Baseline_Model-6:Logistic Regression':
# ROC Curve
train_fpr, train_tpr, _ = roc_curve(y_train, pipelined_model.predict_proba(X_train)[:, 1])
valid_fpr, valid_tpr, _ = roc_curve(y_valid, pipelined_model.predict_proba(X_valid)[:, 1])
test_fpr, test_tpr, _ = roc_curve(y_test, pipelined_model.predict_proba(X_test)[:, 1])
plt.plot(train_fpr, train_tpr, label='Train ROC Curve')
plt.plot(valid_fpr, valid_tpr, label='Validation ROC Curve')
plt.plot(test_fpr, test_tpr, label='Test ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
expLog
# Predict class labels for the test set
pred = pipelined_model.predict(X_test)
# Create histogram of predicted class labels with a new color scheme
plt.figure(figsize=(8, 6))
sns.histplot(pred, kde=False, color="#5C3C92", alpha=0.8)
plt.xlabel("Predicted Class Label", fontsize=15)
plt.ylabel("Count", fontsize=15)
plt.title("Histogram of Predicted Class Labels", fontsize=18)
# Compute the F1 score for the test set and print it
f1 = f1_score(y_test, pred)
print("F1 Score: ", f1)
To prepare the submission file, a probability prediction for the TARGET variable must be provided for each SK_ID_CURR in the test set. The format of the file should include a header and adhere to the require structure .
X_kaggle_test
# Print the X_kaggle_test dataframe nicely using the `to_string()` method
#print(X_kaggle_test.to_string(index=False))
test_class_scores = pipelined_model.predict_proba(X_kaggle_test)[:, 1]
from pprint import pprint
pprint(test_class_scores[0:25])
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
submit_df.to_csv("submission.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "baseline submission"
In this project, we aim to predict the probability of default for Home Credit clients based on various features derived from historical data. Home Credit provides loans to clients but faces challenges in assessing the creditworthiness of clients with little or no credit history. Our primary objective is to use historical data from multiple sources and construct a robust machine learning model that can accurately predict the risk of default. To achieve this, we pre-processed and did feature engineering, performed EDA, and experimented with a range of machine learning algorithms, such as logistic regression, Lasso and SGD Lasso. We fine-tuned these models’ using feature selection to select the best performing model. Key metrics, such as ROC AUC, F1 Score, PR AUC are employed to assess the effectiveness of our models in predicting default probabilities. Our experiments involved comparing these models' performance and identifying the most effective pipeline. The best accuracy (91.94%) was achieved using baseline logistic regression with full batch gradient descent, however, its low F1 score (0.0272) suggests potential imbalanced class performance. By implementing the best model, Home Credit will be able to make more informed lending decisions, minimize unpaid loans, and promote financial services for individuals with limited access to banking, ultimately fostering financial inclusion for underserved populations.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The data used in this project is sourced from a financial institution (Home Credit) that provides loans to customers and it is available on kaggle. The dataset comprises various tables with information about the customers, their loan applications, credit history, and other financial information.
There are 7 different sources of data:
| S. No | Table Name | Rows | Features | Numerical Features | Categorical Features | Megabytes |
|---|---|---|---|---|---|---|
| 1 | application_train | 307,511 | 122 | 106 | 16 | 158MB |
| 2 | application_test | 48,744 | 121 | 105 | 16 | 25MB |
| 3 | bureau | 1,716,428 | 17 | 14 | 3 | 162MB |
| 4 | bureau_balance | 27,299,925 | 3 | 2 | 1 | 358MB |
| 5 | credit_card_balance | 3,840,312 | 23 | 22 | 1 | 405MB |
| 6 | installments_payments | 13,605,401 | 8 | 21 | 16 | 690MB |
| 7 | previous_application | 1,670,214 | 37 | 8 | 0 | 386MB |
| 8 | POS_CASH_balance | 10,001,358 | 8 | 7 | 1 | 375MB |
As part of the data download comes a Data Dictionary. It is named as HomeCredit_columns_description.csv. It contains information about all fields present in all the above tables. (like the metadata).
The main steps to achieve this objective are:
Understand the data : We have multiple tables with different granularities. We start by examining the data dictionaries provided and understanding the relationships between tables. Identify the primary key for each table and determine how the tables can be joined.
Data preprocessing : We first clean and preprocess the data. We will perform EDA to understand the relationships between various features and their significance in predicting the target variable.
Feature engineering : For each secondary table (e.g., bureau, previous_application, etc.), we create new features that capture relevant information. This can involve calculating summary statistics like mean, median, min, max, and count for numeric columns, or counting occurrences for categorical columns. Consider creating interaction features or ratios between existing features if they make sense in the context of the problem
Aggregate secondary tables : Group the secondary tables by the common key (usually SK_ID_CURR) and aggregate the features using relevant aggregation functions (e.g., sum, mean, count, etc.). This step will create a single-row summary for each customer in the secondary tables.
Merge primary and secondary tables : Combine the main dataset with the secondary datasets using appropriate join operations to create a comprehensive dataset that captures all relevant information about the customers. Ex. Merge the aggregated secondary tables with the primary table (application_train or application_test) using the common key (SK_ID_CURR). Perform left joins to ensure that you retain all records from the primary table.
Dimensionality reduction : After merging, you may end up with a large number of features. Use feature selection techniques to remove irrelevant or redundant features. Some common methods include
Correlation analysis : Remove highly correlated features to avoid multicollinearity. Feature importance : Use algorithms like Random Forests or Gradient Boosting Machines to rank features based on their importance. Recursive feature elimination (RFE) : Train a model and iteratively remove the least important features. Lasso regression or Elastic Net : Use regularization methods to shrink coefficients of unimportant features to zero, effectively removing them. Preprocess data : Scale the features and impute missing values and dandle categorical variables using one-hot encoding, label encoding, or target encoding. Model selection and training : Choose suitable machine learning models, such as lasso regression, logistic regression, decision trees, random forests, gradient boosting machines (GBMs), and neural networks. Split the data into training and testing sets and train the models.
Model evaluation : Evaluate the performance of the models using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. We will compare these models' performance and identify the best performing model based on these evaluation metrics.
Model optimization : Perform hyperparameter tuning and feature selection to optimize the model's performance. We will experiment with different combinations of features and hyperparameters to improve the model's predictive accuracy.
By implementing the best model, Home Credit will be able to make more informed lending decisions, minimize unpaid loans, and promote financial services for individuals with limited access to banking, ultimately fostering financial inclusion for underserved populations. The effectiveness of our models in predicting default probabilities will be assessed using key metrics such as ROC AUC, F1 Score, and Gini Coefficient. The corresponding public and private scores will also be evaluated to determine our model's performance.
Please present the results of the various experiments that you conducted. The results should be shown in a table or image. Try to include the different details for each experiment.
The machine learning project on home credit default risk utilized a dataset with 7 individual tables, containing a total of 104 numerical attributes and 16 categorical attributes. The experiment was performed 6 times, with different techniques and algorithms applied to the data. The baseline model used was logistic regression with full batch gradient descent, which achieved a test accuracy of 0.9194, a test area under the curve (AUC) of 0.7436, and a test F1 score of 0.0272. A logistic regression model with 15 selected features resulted in a slightly lower test accuracy of 0.9159, test AUC of 0.7355, and test F1 score of 0.0120. Undersampling techniques were also applied, including logistic regression and lasso regression with undersampling. The logistic regression with undersampling achieved a test accuracy of 0.7721, test AUC of 0.7382, and test F1 score of 0.3192. The lasso regression with undersampling had a lower test accuracy of 0.7560, test AUC of 0.7089, and a test F1 score of 0. Additionally, SGD (Stochastic Gradient Descent) lasso regression with undersampling was utilized, resulting in a test accuracy of 0.6614, test AUC of 0.6052, and a test F1 score of 0.4166. Another logistic regression model with undersampling achieved a test accuracy of 0.7792, test AUC of 0.6086, and test F1 score of 0.3783. Overall, the results highlight the varying performance of different techniques and algorithms. Moreover, the best accuracy of 91.94 % was achieved by the baseline logistic regression with full batch gradient descent with significantly low F1 score of 0.0272 which means that the model is making accurate predictions overall but may have imbalanced class performance or inconsistent trade-off between precision and recall.
Our project focused on predicting the probability of default for Home Credit clients using machine learning techniques. we employed baseline machine learning pipelines which includes logistic regression, lasso regression, SGD with feature engineering, hyperparameter optimization, and undersampling. We evaluated their performance using key metrics. The best accuracy (91.94%) was achieved using baseline logistic regression with full batch gradient descent; however, its low F1 score (0.0272) suggests potential imbalanced class performance. Future work involves further experimentation with other algorithms like SVM, KNN, GBM's like XGBoost, and neural networks, feature engineering techniques, and sampling methods, as well as incorporating domain-specific knowledge and expanding the dataset. Our project lays the foundation for Home Credit to make more informed lending decisions and promote financial inclusion for underserved populations.
Please provide a screenshot of your best kaggle submission. The screenshot should show the different details of the submission and not just the score.
from IPython.display import Image
Image(filename='Kaggle.png')